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(57) Abstract 



A data switch for handling packets of information switch comprises input traffic managers, ingress routers, a memoryless cyclic 
switch fabnc. egress routers and output traffic managers all acUng under the control of a switch controller. Each ingress router includes a 
set of virtual output buffers one for each output traffic manager and each message priority. Each data packet or cell as it arrives is examined 
to identify the output traffic manager address and its message priority. The switch controller uses a first arbitration and selection process 
to schedule the passage of the next cell across the switch fabric which the ingress router uses a second arbitration and selection process to 
select the appropriate virtual output queue for use in the switch fabric transfer. 
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which means that the storage devices must be very high performance. Hence, at 
very high data rates current technology limits the us^f output buffers. 

It is an object of the invention to proyide a data switching method and 
apparatus for the more efficient handling of pac^ts of information through a data 
switch. 

According to a first aspect of the invention there is provided a method 
of handling packets of information throughla data switch comprising input traffic 
controllers, ingress routers, a memoryless cyclic switch fabric, egress routers and 



output traffic controllers all under the control of a switch controller and 

I 

interconnected such that each input line connected to the data switch is terminated 

I 

on a traffic controller arranged to convert the input line protocol mformation 



packets into fixed length cells having a header defining the data switch destination 

I 

router and output traffic controller together with message priority information 
arranged such that each ingress router serves a group of traffic controllers 
characterised in that the ingress router includes a set of input buffers one for each 
input line and a set of virtual output ^ueue buffers, one for each output traffic 
controller from the data switch, and in /jwhich the method comprises on the arrival 



of a cell from a traffic controller the ingress router examines the cell header and 

I 

places it in the appropriate virtual outjDut queue and generates a request for transfer 
message consisting of the destination traffic controller address and a message 
priority code which is passed to theldata switch controller, the switch controller 
schedules the passage of the cells apross the switch fabric by interconnecting a 
specific ingress router to a specific egress router for each switch fabric cycle in 
accordance with a first arbitration process the ingress router selecting from the 



appropriate virtual output queue the cell at the head of the queue for passage across 
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the data switch to the appropriate output tra^c controller in accordance with a 
second arbitration process. 

According to a second aspecf of the invention there is provided a data 
switch for handling packets of information comprising input traffic controllers, 
ingress routers, a memoryless cyclic svwitch fabric, egress routers and output traffic 
controllers all under the control of a switch controller and interconnected such that 
each input line connected to the data switch is terminated on a traffic controller 
arranged to convert the mput line pf otocol information packets into fixed length 



cells having a header defining the data switch destination router and output traffic 
controller together with message /priority information arranged such that each 
ingress router serves a group of tr^ic controllers characterised in that the ingress 
router includes a set of input bu|fers one for each input line and a set of virtual 
output queue buffers, one for each output traffic controller connected to the data 
switch, and in which on tlie anyal of a cell from a traffic controller the ingress 
router examines the cell header and places it in the appropriate virtual output queue 
and generates a request for traififer message consisting of the destination traffic 
controller address and a messaee priority code which is passed to the data switch 
controller, the switch controller schedules the passage of the cells across the switch 
fabric by interconnecting a specific ingress router to a specific egress router for 
each switch fabric cycle in accordance with a first arbitration process and the 
ingress router selects fi-om the appropriate virtual output queue the cell at the head 
of the queue for passage across the data switch to the appropriate output traffic 
controller in accordance with a second arbitration process. 



The invention together with its various features will be more readily 

understood fi-om the following description which should be read in conjunction 

I 

with the accompanying drawings, in which:- 



I 
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Fig. 1 shows a generalised concept of the*prior art. 
Fig. 2 shows in block diagram form one^jgrnbodiment the data switch 
of the invention. 

Fig. 3 shows the switch fabric of tho^mbodiment of the invention. 

Fig. 4 shows the flow of data throu|h the switch fabric. 

Fig. 5 shows ATM frame headers^hen passing through the switch 



fabric, 
fabric, 
switch. 



Fig. 6 shows Ethernet frame headfers when passing through the switch 

/ 

Fig. 7 shows the schedulmg aim arbitration arrangements of the data 



Fig. 8 shows an egress backpressure broadcast, 

i 

Fig. 9 shows the switch block diagram. 

Fig. 10 shows the detail of the switch block, 

I 

Fig. 1 1 shows a block diagram of the master according to the 
embodiment of the invention, | 

Fig. 12 shows a block? diagram of a router according to the 
embodiment of the invention, whilst I 

Fig. 13 shows the queuejstructure. 

Referring now to Figure; 1, this shows the general concept of a data 
switch. Inputs Nl to Nn are connectedto respective input ports IPl to IPn of a data 
switch SW. The switch has ouptput [ports OPl to OPn connected to respective 
outputs Ml to Mn. 



With intelligent distributed scheduling mechanisms it is possible to 

I 

create an mput buffered switch which meets the same traffic shaping efficiency of 
its output buffered counterpart. The use of input buffers is preferred for several 
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reasons. Input buffering requires smaller buffe^, which can have relatively low 
performance and therefore be cheaper. 

When cells are queued at ^e input there is the possibility of 
contention arising through the phenomena /f Head Of Line (HOL) blocking. This 
generally occurs when First In First Out (FIFO) queue mechanisms are used. The 
FIFO queues the cell at the head of the /ueue and this is the only one that can be 
chosen for delivery through the switch, ^^ow, consider the case where an input port 
has three cells cl, c2, c3 stored such fliat cl is at the head of the queue with c2 
stored next and c3 last with cell cl desLed for port N and cell c2 destined for port 
N+1. Now port N is ahready connicted to port N-1 therefore cl cannot be 
switched, however port N+I is unconnected and therefore c2 could actually be 
delivered. However, c2 cannot getLt of the FIFO because it is blocked by the 
HOL i.e. cl. An intelligent appro'ach to the solution of HOL blocking is the 
concept of Virtual Output Queues (JOQ). Using VOQs the cells are separated out 
at the input into queues which map directly to their required output destination. 
They can therefore be effectively dlscribed at being output queues, which are held 
at the input i.e. virtual Output Quiues. Since the cells are now separated out in 
terms of their output destination|they can no longer be blocked by the HOL 
phenomena. 



There is also the question of Quality Of Service (QoS) to address. 
Different input sources have difflent requirements in terms of how their data 
should be delivered. For example Aice data must be guaranteed to a very tightly 
controlled delivery service whereas the handling of computer data can be more 
relaxed. To accommodate these requirements the concept of priority can be used. 
Data is given a level of priority, which changes the way the switch deals with it. 
For example consider two cells inldifferent VOQs cl and c2 which are both 
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RefeiTing now to Figure. 2, the main/feature is the data switch SW. 
Inputs are provided to the switch from ingress tr^c manager units ITMo to ITM„. 
Each ingress traffic manager may have one orimore input line end devices (ILE) 
connected to it. Outputs from the switch SW J& connected by way of egress traffic 
manager units ETMq to ETM„ to egress lineind devices (ELE). 

The traffic manager units (MM and ETM) provide the protocol- 
specific processing in the switch, such L congestion buffering, ingress traffic 
policing, address translation (ingress add egress) and routing (ingress), traffic 
shaping (ingress or egress), collection ofitraffic statistics and line level diagnostics. 
There may also be some segmentation and re-assembly functionality within a 
traffic manager unit. The line end devices (ILE and ELE) are full-duplex devices 
and provide the switch port physical interfaces. Typically, line end devices will be 
operated in synchronous transfer mJde, ranging from OC-3 to OC-48 rates or 
10/100 and Gigabit Ethernet. 



The switch SW provides the application independent, loss-less 
transport of data between the traffic managers based on routing information 
provided by the traffic managers and the connection allocation policies determined 



by the switch control SC. This com 
connection management, switch 
redundancy management. 



ols the global functions of the switch such as 
level diagnostics, statistics collection and 



The switching system just described is based on an input-queued non- 
blocking crossbar architecture. A/combination of adequate buffering, hierarchic 
flow confrol, and distributed scheduling and arbitration processes ensure loss-less, 
efficient, and high performance switching capabilities. It should be noted that the 
ingress and egress functions are shown separately on either side of the drawing. 
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In reality, traffic manager units , line-end devices ingress and egress ports may be 
considered full duplex / 

/ 

Fig. 3 shows the basic architecture of the switch according to one 
embodiment of the invention. The mgres/ traffic manager units ITM described 
above connect streams of data to a numbeV of ingress routers SRIq to SRIp. These 
routers are connected to the switching matrix SCM, which is itself controlled by 



a switch controller SM. Data outputs from the switching matrix SCM are passed 
by egress routers SREo to SREp and on to the egress traffic manager units ETM. 

/ 

The ingress routers SRIo to SRIp on the ingress side collect data 
streams from the ingress traffic manager units ITM, request connections across the 
switching matrix SCM to the controller SM, queue up data packets (referred to as 

ft 

"tensors") until the controller SM g^ts a connection and then sends the data to the 

I 

switching matrix SCM. On the earess side, the egress routers SREo to SRE„, sort 

I 

data packets into the relevant data streams and forward them to the appropriate 
egress traffic manager units ETM.| Each ingress and egress router communicates 
on a point to point basis with two traffic manager units over a common switch 
interface. Each interface is 32 bits wide (full duplex) and can operate at either 50 



or lOOMHz. Through its common interfaces the routers can support up to 5Gbs of 



cell-based fraffic such as ATM or 4Gbs of packet based traffic such as gigabit 
Ethernet. These 4 or 5Gbs of data share a small amount of external memory. 



The switch controllerlSM takes connection requests from the ingress 
routers and creates sets of connections in the switching matrix SCM. The 
controller SM arbitration mechan|sms maximise the efficiency of the switching 
SCM while maintaining fairness ofiervice to the routers. The controller SM is able 
to configure one-to-one (unicast)/and one-to-all (broadcast) connections in the 
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switching matrix SCM. The controller SM s^cts an optimal combination of 
connections to estabUsh in the matrix SCM one/per switching cycle. The selection 
can be postponed by one (or more) backpfessure broadcast requests that are 
satisfied in a round-robin fashion before alloWing normal operation to resume. The 
arbiter also uses a probabilistic work-consferving algorithm to allocate bandwidth 
in the switching matrix to each priority^ccording to information defined by the 
external system controller. 

The switching matrix SOM itself consist of a number of memory-less, 
non-blocking matrix planes SCm/ - N and a number of embedded serial 
transceivers to interface to the routere. The number of matrix planes in a particular 
switch depends on the core throuihput required across the matrix. The core 
throughput will be greater than theiggregate of the external interfaces to allow for 
inter-router communication, core header overheads and maximal connections 
during the arbitration cycles. The' device is packaged with two planes of sixteen 
ports, which can be configured to provide an alternative number of planes/ports. 
The multiple serial links that Comprise the data path between the router and 
switching matrix are switched simultaneously and therefore act as a single full 
duplex fat pipe of 8Gbps. The switching matrix has the novel feature that it can be 
configured as a 'NxN' port crossiar device where N can be 4, 8 or 16. This feature 
can increase the number of plants per package and therefore allows a wide range 
of systems to be realised cost ef ectively. For example using the first generation 
chip set systems of less than 20Gbs up to SOGbs can be easily configured. 

Underlying the management of the system is the fabric management 
interface FMI, which provides all external orthogonal interface into all of the 
system devices. This level of mankgement provides read/write access to a chosen 



subset of important registers and R 



s while the device is functioning normally. 
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and will provide access to all the registers and RAMs in a device (using the scan 



mechanism) if the system is inoperable. Management access can be used for the 
purposes of system initialisation, and dynamic reconfiguration. The following 
features need to be configured via the fabric management interface FMI at system 



reset but can be modified on a live system: ingress router queue parameter 

n 

sensitivities, ingress and egress queue thresholds, bandwidth allocation tables in the 



routers and switch controller and status information. Each device has a primary 
status register, which can be read to obtain the high-level view of the device status; 



for example, the detection of non-critical failures. If necessary, more detailed 
status registers can then be accessed. 

If a device or the total system l^ils, fabric interface management access into 

I 

the chip-set is still possible. This Will normally provide usefiil information in 
diagnosing the fault. It can also b| used to perform low-level testing of the 
hardware. 



Detailed error management facilities have been built into the system. The 
management of errors can be considered under the headings of detection. 



correction, containment and reporting as described below: 



a) Detection. Within the system, all interfaces between devices are checked 

j 

as follows:- parallel interfaces between devices are protected by parity. Serial data 
being routed fi-om one router to another via the switching matrix is protected by a 
sixteen bit cyclic redundancy code This is generated in the ingress router and 
forms part of the tensor. It is chejcked and discarded at the egress router. All 
external interfaces support parity and the common switch interface specification 



includes optional parity. This is implemented at the system end of the interface. 

1 

s 

\ 
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There are certain units within the systen/that are common to all devices. 
The two units of most interest are the Central/Management Unit and the Fabric 
Management Interface. ' 



Data is handled throughout the system in fixed length cells. There are 

several reasons for using fixed length cells, one of which is that the quality of 

/ 

service (QoS) is easier to guarantee wjien the switch is reconfigured after every 
switch cycle. In addition, the packet latency is improved for both long and short 
packets and the buffer management is simplified. In practice, there are slight 
variations in the format of the cells, due to the need to include steering information 
in headers at various points. Figure 4 shows the flow of data through the switch 
fabric and the fiinctions performed by the seven steps shown in the diagram are 
detailed below:- | 

Firstly, packets received ftom a line end are, where necessary, segmented 

^ I 

m a mgress traffic manager ITM and formed into cells of the correct format to be 

f 

transferred over the common interface, denoted in Figure 4 as CSIX; 

I 

Secondly, at the ingress router SRI, arriving cells are examined and placed 

I 

in the appropriate queue. There are several sets of queues, shown here as unicast 
queues UQ, multicast queues MQ and broadcast queues BQ.. In the diagram the 
cell has been placed into one of the 



Thirdly, the arrival of a 



le unicast queues; 



cell triggers a 'request for transfer' RFT to the 
controller SM. The cell will be held in the queue until this request is granted; 

At step 4, the controller SM executes an arbitration process and determines 
the maximal connection set thatj can be established within the switching matrix 

SCM for the next switching cycld. It then grants the 'request to transfer' RTT and 

I 

signals the egress router SRE that it must expect a cell. 

At the fifth step, the ingress router SRI, having been granted a connection, 
also executes an arbitration process to determine which cell will be transferred. 

/ 
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The cell is transferred through the memory-less switching matrix SCM and into a 
buffer in the egress router SRE. / 

As shown at step 6, there is one egres/ buffer per egress traffic manager 
ETM and arriving cells are examined and placed in the appropriate traffic manager 
queue in the egress router SRE. / 

Finally, at step 7, the cell is transferred to the egress traffic manager EME 
over the standard interface CSIX and, where necessary, re-assembled into a packet 
before onward transmission. 

The transfer of data through thelsystem is packaged in cells termed tensors. 
An arbitration cycle transfers one tei^or per router through the switching matric 
SCM. Each tensor consists of 6 or sfvectors. A vector consists of one byte per 
plane of the switching matrix and is transferred through it in one system clock 
cycle. The sizes of the vector and terisor for a particular application are determined 



by the bandwidth required in the fabric and the most appropriate cell size. The 
following sections show the typical packaging of the data as it flows through the 
system for ATM and Ethernet. 

As shown in Figure 5a, illustrating the ASTM application, payload cells 
P containing fifty three bytes of data arriving fi-om an ingress traffic manager ITM 
across the interface CSIX are repackaged into 60-byte tensors (6 vectors of 10 
bytes). The ingress router analyses the CSIX header UH and wraps the CSIX 
packet with the core header CH tolcreate a 60-byte tensor UCT in an ingress queue. 
When the controller SM grants the required connection the tensor passes through 
the switching matrix SM in one switch cycle to the egress router which writes the 



umcast tensor UT mto the egress queue indicated in the core header. When the 
tensor reaches the head of the egress queue, the core header is stripped off and the 
remaining CSIX packet is sent tci the egress traffic manager. 



wo 00/38375 PCT/GB99/03748 

- 14- 

t 
/ 

If the CSIX frame type indicates a multiciast packet MT as shown in Figure 
5b, the ingress router strips out the multicast mask MM and replicates the packet. 

into the indicated ingress queues, modifymg the target field for each copy as 

/ 

appropriate. The flow then proceeds as for unicast, except that the tensor is written 
simultaneously into multiple egress buyers after passing through the switching 
matrix. 



In the case of Ethernet or vari|ble length packets as shown in Figure 6, an 
ingress traffic manager ITM using segmentation and reassembly functionality 
(SAR) converts the variable lengthlpackets VLP into CSIX packets at ingress, 
embedding the SAR header in the/payload. CSIX packets are then transported 
though the system in the same wav as for the ATM example of Figure 5, except 
that the tensor size is set to 80 bytes' (8 vectors of 10 bytes) allowing up to 70 bytes 
of Ethernet frame to be carried in a single segmented packet. Note that the 
segmentation header is considered|as private to the traffic managers and is shown 
for illustrative purposes only. The system treats it transparently as part of the 
payioad. The CSIX interface description allows for truncated packets, that is, if a 

traffic manager sends a payioad that would not fill a tensor it can send a shortened 

I 

CSIX packet. The ingress router stores the short packet in the ingress queues (on 

r 

fixed tensor boundaries). Any part of the tensor queue that is not used is filled with 
INVALID bytes. The fixed sizi tensors will then have the INVALID bytes 
discarded at the egress router. 



In the system architecture, the scheduling and arbitration arrangement is 
distributed and occurs at two poin s; in the controller SM (between switch ports 
and between priorities) and m the ro liter (between traffic managers) SRS/A. Figure 
7 is a conceptual diagram showing <|nly the schedulmg/arbitration fimctions across 
the data switch from the trafBc maimgers TM through the common interface CSIX 
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to the routers SR and the controller SCM into th/switching matrix SC port. The 
diagram also shows how information on chanifcl, link bandwidth allocation and 
switch efiSciency, queue status, backpressure^ind traffic congestion management 
is handled by the referenced arrows. 



The controller SM provides the .overall control function of the system. 
When the routers request connections /from the controller, they identify their 
requested switching matrix connection by switch port and priority. The controller 



then selects combinations of connections in the switching matrix to make best use 
of the matrix connectivity and to provide fair service to the routers. This is 



accomplished by using an arbitration mechanism. The controller SM can also 
enforce pseudo-static bandwidth allocation across the priorities and ingress/egress 
switch port combinations. For example, an external system controller can 



guarantee a proportion of the available bandwidth to each of the priorities and to 
specific connections. Unused allocations will be fairly shared between other 
priorities and connections. 

The controller SM also has a 'best effort' mechanism to dynamically bias 
the arbitration in favour of longiqueues for applications that do not require strict 
bandwidth enforcement. 

The routers provide an aggregation function for multiple traffic managers 
into a single switch port. Whjen the controller SM grants a connection to a 

particular egress switch port thriough a particular priority, the appropriate router 

I 

must choose one of up to eight unicast and one multicast traffic manager queues 
to service. This is accomplishe|i through a weighted round-robin mechanism, 

which can select a queue based on a combination of ingress queue length. These 

I 

may allow for favouring of long queues over shorter ones, or allows traffic 



wo 00/38375 ^ PCT/GB99/03748 

- 16- 

manager to temporarily increase the weighting/^ a queue via the urgency field in 
the CSIX header. Queue bandwidth allocaticto is also a factor, determined by the 
external system controller or by dynamic intervention. Finally, target congestion 
management and traffic shaping are featurtSs taken into account. The sensitivity of 
the weighting function to these parameters are determined for each priority and, 
together with the bandwidth allocationsf may be altered dynamically. 

/ 

The system implements three* levels of backpressure, described in more 
detail below. These are flow, traffic mLiagement and core backpressure. Flow level 
backpressure occurs between ingres/ and egress traffic managers for the purposes 
of congestion management and trafeic shaping. Backpressure signalling at this 
level is accomplished by the traffi Jmanagers sending packets of data through the 
system and hence is beyond the scope of this document Flow level backpressure 
packets will appear to the system to be no different than data packets and as such, 
are transparent. 



So far as traffic level backpressure is concerned, the system is organised 
to manage its data-flow at thej traffic manager level of granularity (with four 
priorities). Further granularity i| achieved at the traffic manager itself. An egress 
traffic manager can send backpressure information to the egress side of the router 
over the CSIX interface, multip] exed with the datastream. Since the egress side of 
the router has just a single quei e per traffic manager, this is just a one-bit signal. 
Backpressure between routers is signalled via a dedicated broadcast mechanism in 
the switching controller and switching matrix. There are a number of thresholds 
in the egress buffer queues. When a threshold is crossed, the egress router signals 
the controller with a backpresstre broadcast request. In the controller, such a 
request stalls the arbiter at the er d of the current cycle and the controller issues a 
one vector broadcast connection to the switching matrix planes and informs the 
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requesting egress router .The egress router then sends one vector's worth (10 bytes) 
of egress buffer status through the matrix to the ingress routers. The controller then 
continues the interrupted cycles. In the eveiit of several egress routers 
simultaneously requesting a backpressure broadcast, the controller will satisfy all 
the requests in a simple round-robin manner before resuming normal service. The 
latency mtroduced in the backpressure mechanism due to this contention does not 



affect the egress buffering since during this period a router will only be receiving 
backpressure data from other routers, which/does not need to be queued. 



An egress router will aggregate the^threshold transitions from all its egress 
queues, which have occurred during a switch cycle into one backpressure broadcast 
so that the maximum number of backpresjsure broadcasts between two tensors, is 
limited to the number of routers. When an ingress router ingress receives a 
backpressure broadcast vector of the form shown in Figure 8, it uses it to update 
the ingress queue weightings as appropnate. 



Two modes of backpressure signalling between egress and ingress routers 
are supported, namely start/stop and miilti-state signalling. Multi-state signalling 
allows the egress router to signal the rtiulti-bit state of all its queues (1 byte per 
queue). This multi-state backpressure iignalling coupled with weighted-round- 
robm scheduling in the ingress routers minimises the probability of egress queues 
being foil, which is significant when attempting to forward multicast or broadcast 
traffic in a heavily utilised switch. 



The ingress router signals stop/start backpressure to the ingress traffic 
managers via the CSIX interface. This provides a 16-bit backpressure signal to 



allow the ingress router to identify the ingress queue to which the signal relates. 
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Egress queue thresholds are set globally., whilst ilngress^^ueue thresholds are set 
per queue. 




The controller does not keep track of the state of the egress router buffers. 
However, core level backpressure, is in place across the router/controller interface 



to prevent egress buffer overflows, by preventing the controller from scheduling 
any traffic to a particular egress router in th^ event that all of the respective 
routers' buffers are full. 

Multicast in this system chipset is /implemented through the optimal 
replication of tensors at ingress and egress. /An ingress router has one multicast 
queue per egress router per priority. An indtess multicast tensor (see Figure 8) is 
created in each of the appropriate queues Aith the egress muhicast masks in the 
target fields TM of the core headers. Each/ensor, of which three are shown, has a 
length equal to 6 or 8 vectors and a width 6f 10 bytes. A backpressure vector BPV 
may be inserted between adjacent tensors /as shown. The multicast tensors are then 
forwarded through the core in the same Way as for unicast and the egress routers' 
then replicate the tensors into the required egress buffers in parallel. This multicast 
mechanism is intended to provide optimum switch performance with a mix of 
unicast and multicast traffic. In particular, it maintains the efficiency and fairness 
of the scheduling and arbitration allowing the switch to provide consistent quality 
of service. 

The sytem provides a loss-less|fabric, therefore multicast tensors cannot be 
forwarded through the switching matrix unless all the destination queues are not 
full. In a heavily utilised switch if only stop/start backpressure from the egress 
queues was implemented, this could severely restrict the useable bandwidth for 
multicast traffic. Two mechanisms ire included in the system to improve its 
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multicast performance. These are: 1. multi-state backpressure from the egress 
router, which reduces the probability of egress queues*being fiill, and 2. increasing 
the weighting of the multicast ingress queues the weighted-round-robin 
scheduler when they have been blocked to in^ease their chances of being 
scheduled when the block clears. To avoid multic^t (and broadcast) being blocked 
by off-line egress ports, the backpressure sign^ can be individually masked out 
by an external system controller via the Fabri ^Management Interface (FMI). 



The requirement for wire-speed broadcast (benchmarking) is met by having 
a single on-chip broadcast queue in each/egress router. When the controller 
schedules a broadcast connection, the tensor will be routed in the switching matrix 
to all routers in parallel, thus avoiding any ingress congestion (no tensor replication 
at ingress). Broadcast baclqjressure is provided by having each router inform the 
controller when it transitions in to or out ,(of the state "all-egress-buffers-not-full". 
The controller will only schedule a broadcast when all egress buffers in all routers 



are not full. Broadcast backpressure is ajconfigurable option. If it is not activated. 



the routers do not send status messages and the controller schedules broadcasts on 

demand. Using this method there is no guarantee that the packet will be forwarded 

J/ 

on all ports. 



The switching matrix is shown in schematic form in Figure 9. It comprises 



a high-speed, edge-clocked, synchronous, 16 port dual plane serial cross-point 
switch SCN for use in the system. It has been optimised to provide a scaleable, 
high bandwidth, low latency data movement capability. It operates under the 
control of the controller SM, which sends configuration information to the matrix 
over the controller interface SMI to create connections for the transmission of data 
between routers. The buffer and decode Ibgic BDL receives this information and 
uses it to control the interconnections within the matrix. Data is applied in serial 

/ 
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field to connect ingress and egress ports. For 4 and s/ort configurations, the 
number of bits of the control port field required is 2 an/s respectively. 

In operation, the switching matrix receives c^figuration information fi-om 
the controller SM via the controller interface SMI. This information is loaded into, 
and stored in, configuration registers. Routing information is passed in the form of 

/ 

a number of encoded fields detemining which input port is to be connected to each 

/ 

output port via the switching matrix. In a 16 x 16 matrix, there are 16 output ports. 
For each output port there is a four bit source address which is encoded to define 
which input port is to be connected to an output port. There is also an enable signal 
for each field to signal that the field is valid/and a configure signal that indicates 
that the whole interface is valid. If a field isfsignalled as not valid, the output port 
for that field is not connected. If the configure signal is not asserted, the matrix 
does not change its current configuration! The configuration information on the 
controller/matrix interface is loaded intJthe device when the configure signal is 
asserted. A 16-stage programmable pipeline is used to delay the configuration 
information until it is required for switc^g the matrix. If there is a parity error on 
a port then that ports enable signal w|ll be set to zero and a null tensor will be 
transmitted to the output of that port, pe register that holds the parity error may 
only be loaded when the configure signal is high and is cleared when read by the 
diagnostic unit. A parity check is alio carried out on the configure signal. If a 

parity error occurs here then a parity fail condition is asserted, all port enable 

1 I 
signals are set to zero and all the cmtput ports on the device will transmit null 

tensors. The connection between the routers and the matrix is via a set of serial data 

streams, each running at one Gbaud. |Once a connection across the matrix has been 

set up, tensors are transmitted between ingress and egress routers. The whole 

process exhibits low latency due to a very small insertion delay. Multiple switching 
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There is also another mechanism in the controll^router interface referred 
to as 'core level backpressure', which prevents the c^troUer from scheduling any 
trafSc to a particular egress router. A router uses ;fore level backpressure when all 
its egress buffers are full. 

The controUerer is capable of estaKIishing both unicast and broadcast 
connections in the switching matrix. It is/also capable of dealing with system 
configurations that contain a mixture of '^11' and 'half speed' ports, for example 
a mixture of lOGbit/sec and 5Gbit/sec rofiters. 

Figure 12 shows a router devic/ This is a system port interface control 
device. Its main function is to support user applications' data movement 
requirements by providing access in/o and out of the system. There are two 
instances of the ingress interface unit HU, one for each of the trafSc managers that 
can be connected to a system port. ThJ IIU is responsible for transferring data from 
a traffic manager into an intemal FIF© queue on the router and informing the ICU 
that it has tensors ready to transmit into the system. The external interface to the 
traffic manager utilises common system interface CSIX. This defines an n x 8-bit 
data bus; the ingress interface units HU operate in a 32-bit mode. The FIFO is four 
tensors deep to allow one tensor to be transferred to the ICU while subsequent ones 
are being received. 



To generate the tensors, the ingress interface unit appends a three byte 

system core header to the CSIX fiJme prior to passing it, indicating the tensors 

availability to the ICU. The IIU exJmines the CSIX header to determine whether 

I 

the frame is of type unicast, multicpt or broadcast and indicates the type to the 
ICU. If the frame is unicast, the I U sets a single bit in byte 1 indicating the 
destination TM, this is derived from the destination address in the CSIX header. If 
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the frame is multicast, a tensor is constructed and sent for each of the 16 system 
ports that have a non-zero CSIX mask. In the case of a broadcas^CSIX frame, byte 
1 is set to all Is by the nu. The IIU is also responsible for caJ^lating the two byte 
Tensor Error Check which utilises a cyclic redundancy chefck. 

Traffic manager flow control is provided by msdng each ingress interface 
unit IIU responsible for signalling traffic mana^/r start/stop backpressure 
information to its associated egress interface unit/EIU. The IIU obtains this 
backpressure information by decoding the CSIx/ control bus. If parity error 
checking has been enabled (the appropriate bit in a /tatus register is set) and the IIU 
detects a parity error on CSIX, then an error is logfed and the corresponding tensor 
discarded. This log can be retrieved via the FMJ 

/ 

The mgress control unit ICU is responsible for accepting tensors from the 
ingress interface units IIUs, making connection requests to the controller interface 
unit SMIU, storing tensors until the controllerfgrants a connection and then sending 
tensors to transceivers TXR. There are t^o types of connection requests (and 

subsequent grants). One is used for all unicast/multicast tensors and the second is 

I 

used for broadcast traffic. For unicast|inulticast tensors the ingress control 
unit/controller interface unit signalling incorporates the system destination port and 
priority. Clearly for broadcast tensors|there is no requirement for a system 
destination address and since there is onlj^/one level of broadcast tensors, a priority 

identifier is also not required. ' 

i': 

r 

Ingress buffering is illustrated in Figure 13. This buffering for unicast 
queuing UQ is implemented such that there is one for each possible destmation 
traffic manager and priority. In addition to unicast queues, there is a multicast 
queue MQ per port per priority and a single broadcast queue BQ. The queues are 
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/ 

statically allocated. There are 512 unicast, 64 multicast and ofae broadcast queue. 
The unicast and multicast queues are located in external SRAM. The queue 
organisation allows flow control down to OC-12 granultoty. Within the unicast 
address field of the CSIX header, 3 bits are allocate/ for the number of traffic 
managers a router can support. Since the router supn/rts two traffic managers, the 
spare bit field is used for a function known as Serv^fce Channel. Service Channels 
provide the means of fiilly exploiting the routfers implicit OC-12 granularity 
features. / 

When the ingress control unit ICU recbives a connection grant signal from 
the Controller interface unit SMIU (which specifies egress port and priority), the 
ICU must choose one of up to 8 qualifyingAinicast queues or the multicast queue 
from which to forward a tensor. This is ^hieved usmg a weighted round-robin 
mechanism, that takes into account severM parameters. One is the ingress queue 
length, which allows for the favouring M longer queues over shorter ones and 
another is aggregate queue tensor urglncy, which allows a traffic manager to 
temporarily increase the weighting of a queue via the urgency field in the CSIX 
header. One further parameter taken iijto account is queue bandwidth allocation, 
whereby an external system controller or system operator can configure the system 
to provide bandwidth allocation tojindividual flows via the FMI. The final 
parameter considered is that of target egress queue backpressure. This requires that 
the effective performance of the multicast scheme requires that the probability of 
egress queues being full be minimised. The sensitivity of the weighting function 
to the input variables is controlled by 1 four sets of global sensitivity variables (one 
per priority). These settings are configured at system initialisation. 

To provide an ingress flow control mechanism, the ingress control unit ICU 
implements three watermark levels to^ indicate the state of the queues (fairly empty, 
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filling up, fairly full or very foil). The watermarks have associated hystereis and 
both values are configurable via the FMI. When a queue mc/es from one state to 
another, the ICU signals the change to each of the egress^toterface units EIUs. In 
addition to this 'multistate' backpressure mechanism it^is also possible to invoke 
a second mode of backpressure signalling that involve! only start/stop signalling. 
The backpressure mechanism mode is selected /ia the FMI by setting the 
watermark levels appropriately. 



The egress control unit ECU signals egress backpressure information to the 
ICU. This information relates either to the si^alling egress router buffers or to 
information the ECU has received about the state of another egress routers buffers 



If the information relates to the signalling e^ess routers buffers the ICU updates 
the backpressure status used by the ingresi scheduling algorithm and makes a 
request to send backpressure information to*the controller interface unit SMIU. If 
it relates to another egress routers buffers/then the ICU simply updates its own 
backpressure status. 



There are two instances of the egress interface unit EIU, one for each of the 
traffic managers that are connected to /system port. The egress interface unit is 
responsible for accepting tensors fron^jthe ECU and transmitting them as frames 
over CSIX to the associated trafBc maiiager. 



To provide traffic manager ;pow control, the egress interface unit EIU 
accepts traffic manager start/stop backpressure information from it's associated 
mgress mterface umt IIU (that is, the' one connected to the same traffic manager). 
If the EIU is cunrently sending a frame to the traffic manager, then it continues the 
transfer of the current frame and then waits until a start indication is received 
before transferring any subsequentfframes. 
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To provide ingress flow control the EIU accepts ingress bu^r multistate 
backpressure information from the ICU and sends it immediat^y to the traffic 
mamanger. 



The egress control unit ECU is responsible for accepting tensors from the 
serial transceivers, when informed by the controller intefface untit SMIU of their 
mimment amval, and forwarding them to the relevant EIU. The ECU examines the 
traffic manager mask byte of the system core head/r to determine the correct 



destination EIU. In the case of multicast (or broadest) tensors, multiple bits are 

ij 

set in the mask and the tensors are simultaneously/trans ferred to all the EIUs for 
which a corresponding hit is set. This feature provides wire speed multicasting at 
the egress router. The ECU is responsible for checkig the tensor error check bytes 
of the system core header. If the system core error checking has been enabled (i.e. 
the appropriate bit in a status register is set) and the ECU detects an error, then it 
is logged and the correspondmg tensor disci-ded. To provide an egress flow 
control mechanism the ECU implements three watermark levels to indicate the 
state of the egress buffers (fairly empty, filling up, fairly foil or very foil). When 
an egress buffer moves from one state to anotlier the ECU signals the change to the 
ICU. The level of the watermarks is configurable via the FMI. In addition to this 
multistate backpressure mechanism it is alsd possible to invoke a second mode of 
backpressure signalling that involves only start/stop signalling. The type of 
backpressure mechanism is selected via thfe FMI by setting the watermark levels 
appropriately. 



The controller interface unit SMIU is responsible for controlling the 
interface to the controller. Since the controller operates at the system port rather 
than the traffic manager port level of granularity, the SMIU also operates at this 
level. The SMIU maintains a count of the number of tensors in the ingress queues 
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associated with each destination system port. The count is ihcremented each time 
the SMIU is inforaied of a tensor arrival by the ICU and 4ecremented each time the 
SMIU receives a grant from the controller. 

The controller interface unit SMIU contains^ state machine that is tightly 
coupled to a corresponding one in the controller/ For small numbers of tensors 
(less than about six or seven), the SMIU notifies ^e controller of each new tensor 
arrival. For larger numbers of tensors, the SMIu/only informs the controller when 
the count value crosses predefined boundariesJ 

The central management unit is comAon to all devices. Its functions are 
to provide a FMI between each device and/an external controller, control error 
management within the device and provide a|reset interface and reference clocking 
in to each device. 

Referring back to Figure 12, the routers provide access into the system via 
CSIX ingress and egress interfaces. On reieiving a CSIX packet from the ingress 
traffic manager, over the CSIX interface! ICSIX, the ingress interface unit IIU 
checks the type and validity of the packet. iThe packet is then wrapped with a core 
header, the contents of which vary with thi packet type. When the core header has 
been appended, the packet becomes knovin as a tensor. The ingress control unit 

ICU makes a request to the controller thr Jugh the confrollerer interface SMI for a 

I 

connection across the switching matrix arid stores the tensors until the connection 
is created. In order to eliminate head of line blocking for unicast traffic, ingress 
buffering is organised into separate queuel, one for each possible destination traffic 
manager TMQl to TMQN and priority Rl to P4 as shown in Fig. 13. Individual 
queues per priority are not required td avoid head of Ime blocking but are 
advantageous as they allow the controller to enforce bandwidth allocation to each 
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priority in the switch. In addition to the unicast queue there is/multicast queue 
per port per priority and a single broadcast queue. The umcast and multicast 
queues are statically allocated in external SRAM. The ni4)ose of this level of 
buffering is to allow the controller to allocate connection^ efficiently by giving it 
a view of the ingress datastreams and to provide rate n/tching between the router 
external interfaces and the router/matrix interface. 



When connections are granted, the controller creates a connection across 
the switching matrix to the requested egress rou^r at a given priority. The ingress 



control unit ICU must now choose one of the qualifying unicast or multicast queues 
from which to forward a tensor to the transceiver for serialisation. This level of 
router scheduling is done on a weighted-rcJund-robin-basis. Each unicast and 
multicast queue has weighting associate^with it, ..which is determined by the 
backpressure from the egress buffers, the queue length, the queue urgency and the 
static bandwidth allocation. On the egres/side the controllerer informs the router 
of a tensor's imminent arrival. The egres/control unit ECU receives this tensor and 



examines the core header to see which traffic manager to send the tensor to. 
Tensors are then assembled back into datastreams and forwarded via CSIX to the 
appropriate traffic manager. / 

/ 

Multicasting in the system is achieved by the optimal replication of tensors 
at the ingress and egress. On the in^ess side a router has one multicast queue per 
egress router at each priority. Multicast routing information is appended on the 
ingress side and on arrival at the egress side these masks determine the replication 
of tensors into the required egress buffers. Broadcast in the system is achieved by 
having a single on chip broadcast ^ueue at the ingress of each router. When the 
controUer schedules a broadcast connection, the tensor will be routed by the matrix 
to all egress routers in parallel, thus avoiding any ingress congestion. 

,7 
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Each system device contains a logic block known as^e fabric management 
interface unit (FMIU). The FMIU interfaces to the fimcticfoal logic, also known as 
the core, within the device in order to provide run-tiJe (read/write) access to a 
chosen subset of the registers and RAM locations, a ^chanism to report run-time 
fail conditions detected by the device, and scan acc^s (read/write) to the total set 
of registers in the fimctional logic while the functfonal logic is not operational. 

The external interface to the fabric management interface unit FMIU 
requires a number of inputs, including a HardfReset input which sets the system 
device into a known state. In particular, it sets the device into a state where the 
FMIU is fully functional and the serial interface can be used. Hard Reset is 
expected to be applied when power is first ai>plied to the device, and may also be 
applied at other times. The external interface also has a serial inpt and serial output 
lines and a device locator address field us>d to identify a particular instance of a 
device. The device locator field is genera^d by tie-offs that are determined by the 
devices physical position in the system. 

The main functions of the central management unit (CMU) shown in 
Figure 12 include error detection and logging logic. This is responsible for 
detecting error conditions and states witfiin the chip or on its interfaces. As such, 
its functionality is spread throughout tl'e design and is not concentrated within a 
specific block. Errors are reported and stored in the Error and Status registers and 
logs, which are accessible across the : ^MI. The CMU also has reset and clock 
generation logic responsible for the geri iration and distribution of clocks and reset 
signals within the device. In addition, the CMU contains test control logic which 
controls the mechanisms built in for chip test. The target fault coverage is 99.9%. 
This logic is not used under normal operating conditions. The final function of the 
CMU is to provide fabric management l^gic common to all of the system devices 
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CLATMS 

1. A method of handling packets of information through a data switch 
comprising input traffic controllers, ingress routers, a mefeioiyless cyclic switch 
fabric, egress routers and output trafiBc controllers all under the control of a switch 
controller and interconnected such that each input line/onnected to the data switch 
is terminated on a traffic controller arranged to covert the input line protocol 
information packets into fixed length cells having aTheader defining the data switch 
destination router and output traffic controUenr together with message priority 
information arranged such that each ingress/router serves a group of traffic 
controllers characterised in that the ingress rputer includes a set of input buffers 
one for each input line and a set of virtual Output queue buffers, one for each - 
output traffic controller fi-om the data switcfi, and in which the method comprises 
on the arrival of a cell fi-om a traffic controHer the ingress router examines the cell 
header and places it in the appropriate virmal output queue and generates a request 
for transfer message consisting of the destination traffic controller address and a 
message priority code which is passed/to the data switch controller, the switch 
controller schedules the passage oH the cells across the switch fabric by 
interconnecting a specific ingress routefr to a specific egress router for each switch 
fabric cycle in accordance with a first arbitration process the ingress router 
selecting firom the appropriate virtual toutput queue the cell at the head of the queue 
for passage across the data switch to the appropriate output traffic controller in 
accordance with a second arbitraticm process. 



2. A method of handling p^kets of information through a data switch as 
claimed in claim 1 characteris ed in that the ingress buffering is organised into 
separate queues, one for each destination traffic controller and each priority level. 
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3. A method of handling packets of information through a data switch as 
claimed in claim 1 or 2 characterised in t^a^ the ingressm>uter uses a weighted 
round-robin arbitration process to select the next queue/uffer based upon ingress 
queue length, aggregate queue packet urgency and t^et traffic controller egress 
queue backpressure. 

4. A method of handling packets of info^lation through a data switch as 
claimed in claim 1, 2 or 3 characterised in that /hp first arbitration process involves 
determining the set of requests to be accepted'for each switch fabric cycle attempts 
to deliver a packet of information to ea^h output switch fabric port in every 
arbitration cycle. 

5 . A data switch for handling packets of information comprising input traffic 
controllers, mgress routers, a memoryliss cyclic switch fabric, egress routers and 
output traffic controllers all under/ the control of a switch controller and 
interconnected such that each input line connected to the data switch is terminated 
on a traffic controller arranged to /convert the input line protocol information 
packets into fixed length cells having a header defming the data switch destination 
router and output traffic controller together with message priority information 
arranged such that each ingress' router serves a group of traffic controllers 
characterised in that the ingress router includes a set of input buffers one for each 
input line and a set of virtual output queue buffers, one for each output traffic 
controller connected to the data|Witch, and in which on the arrival of a cell fi-om 
a traffic controller the ingress router examines the cell header and places it in the 
appropriate virtual output queue and generates a request for transfer message 
consisting of the destination traffic controller address and a message priority code 
which is passed to the data syitch controller, the switch controller schedules the 
passage of the cells across the switch fabric by interconnecting a specific ingress 
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router to a specific egress router for each switch fabric cycle in accordant with a 



first arbitration process and the ingress router selects fi-om the appropriate virtual 
ou^ut queue the cell at the head of the queue for passage across the^ata switch to 
the appropriate output traffic controller in accordance with a s.^ond arbitration 
process. 




6. A data switch for handling packets of information as claimed in claim 5 
characterised in that the virtual output queues are arranged as separate queues one 
for each destination traffic controller and each prioritylevel. 



7. A data switch for handling packets of information as claimed in claim 5 or 
6 characterised in that the ingress router uses a weighted round-robin mechanism 



to select the next queue buffer based on ingress queue length, aggregate queue 
packet urgency and target traffic controller egr^s queue backpressure 

/ 

8. A data switch for handling packets of information as claimed in claim 5, 
6 or 7 characterised in that the switch controller performs a first arbitration process 



which mvolves determining the set of requests to be accepted for each switch fabric 

/ 

cycle by attempting to deliver a packet of information to each output switch fabric 
port in every arbitration cycle. 



9 . A method of handling packets of mformation through a switch as described 

I 

and shown in the accompanying drawings. 

/ 

10. A data switch for handling packets of information as described and shown 
in the accompanying drawings. 
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